Pearson Correlation

One of the most common errors found in the media is the confusion between correlation and causation in scientific and health-related studies. In theory, these are easy to distinguish — an action or occurrence can cause another (such as smoking causes lung cancer), or it  can correlate with another (such as smoking is correlated with  alcoholism).  If one action causes another, then they are most certainly correlated.  But just because two things occur together does not mean that one caused the other, even if it seems to make sense.

One way to get a general idea about whether or not two variables are related is to plot them on a “scatter plot”. If the dots on the scatter plot tend to go from the lower left to the upper right it means that as one variable goes up the other variable tends to go up also. This is a called a “direct (or  positive) relationship.”  On the other hand, if the dots on the scatter plot tend to go from the upper left corner to the lower right corner of the scatter plot, it means that as values on one  variable go up values on the other variable go down. This is called an “indirect (or negative) relationship."

Picture

Picture

Karl Pearson invented a way to calculate a summary number that allows you to answer the question “How strong is the relationship of a correlation?”  The statistic was named after him. It is called Pearson’s Correlation Coefficient (r).

Picture

Calculating Pearson Correlation Coefficient

Step By Step Directions for Calculating a Pearson's r

1.  Create a table like this one and fill in your values for each variable.  One of the variables is designated as X and the other is designated as Y.  

Picture

2. Calculate and fill in the  X2 and Y2 values

3.  Multiply each X score by its paired Y score which will give you the cross-products of X and Y.

4.  Fill in the last row of the table which contains all of you “Sum Of” statements. In other words, just add up all of the X scores to get the ΣX, all of the X2 scores to get the Σ X2 and etc.

5.  Enter the numbers you have calculated in the spaces where they should go in the formula.

6.  Multiply the (ΣX)( ΣY) in the numerator (the top part of the formula) and do the squaring to (ΣX)2 and (ΣY)2 in the denominator (the bottom part of the formula).

7.  Do the division by n parts in the formula.

8.  Do the subtraction parts of the formula

9.  Multiply the numbers in the denominator.

10.Take the square root of the denominator.

11.  Take the last step and divide the numerator by the denominator and you will get the Correlation Coefficient!

Important Things Correlation Coefficients Tell You ​

It Tells You The Direction Of A Relationship:

If your correlation coefficient is a negative number you can tell, just by looking  at it, that there is an indirect, negative relationship between the two variables.  As you may recall, a negative relationship means that as values on one variable increase (go up) the values on the other variable tend to decrease (go down) in a predictable manner.   If your         correlation coefficient is a positive number, then you know that you have a direct, positive relationship. This means that as one variable increases (or decreases) the values of the other variable tend to go in the same direction. If one increases, so does the other. If one decreases, so does the other in a predictable manner.

Correlation Coefficients Always Fall Between -1.00 and         +1.00:  

A correlation coefficient of -1.00 tells  you that there is a perfect negative relationship between the two variables. This means that as values on one variable increase there is a perfectly predictable decrease in values on the other variable. In other words, as one variable goes up, the other goes in the opposite direction (it goes down).  A correlation coefficient of +1.00 tells you that there is a perfect positive relationship between the two variables. This means that as values on one variable increase there is a perfectly predictable increase in values on the other variable. In other words, as one variable goes up so does the other.  A correlation coefficient of 0.00 tells you that there is a zero correlation, or no relationship, between the two variables. In other words, as one variable changes (goes up or down) you can’t really say anything about what happens to the other variable.

Larger Correlation Coefficients Mean Stronger Relationships

Most correlation coefficients (assuming there really is a relationship between the two variables you are examining) tend to be somewhat lower than plus or minus 1.00 (meaning that they are not perfect relationships) but are somewhat above 0.00.        Remember that a correlation coefficient of 0.00 means that there is no  relationship between your two variables based on the data  you are looking at. The closer a correlation coefficient is to 0.00, the weaker the  elationship is and the less able you are to tell exactly what happens to one variable based on knowledge of the other variable. The closer a correlation coefficient approaches plus or minus 1.00 the stronger the relationship is and the more accurately you are able to predict what happens to one variable based on the knowledge you have of the other variable.

Picture

Making Statistical Inferences from Pearson’s r:

How do you determine whether or not a correlation is simply a chance occurrence or if it really is true of the population? ​ There is a additional step you can do to determine the "significance" of your correlation coefficient.  Just like other statistical tests, the significance of a correlation tests two hypotheses:

Null Hypothesis:

"There is not a significant correlation between the two variables; any observed trend or relationship may be due to chance and sampling error."

For example:  

  • There is no significant correlation between the number of students missing class and the outside weather; (if there is a relationship) the relationship we see in the data may be due to chance and sampling error.

Alternative Hypothesis:

"There is a significant correlation between the two variables; the observed trend or relationship is most likely not due to chance or sampling error."

For example:  

  • There is a significant correlation between the number of years spent smoking and the incidence of lung cancer; the positive relationship we see in the data is mostly likely not due to chance or sampling error.

You will need three things in order to determine whether you can infer that the relationship you found in your sample is significant (in other words, “is generalizable” in the larger population):

  1. The Correlation Coefficient that you calculated    
  2. Something called the “degrees of freedom” which is simply the number of pairs of data in your sample minus 2. 
  3. A table of “Critical Values” of the correlation coefficient.

The first thing you need to do is look down the degrees of freedom column until you see the row with the number of degrees of freedom that matches your sample degrees of freedom.  Look across to the number listed under 0.05 level of significance.  This number is called “the critical value of r”.

Picture

Draw your conclusion by comparing the calculated and critical r values:

Calculating Correlation in Google Sheets

Using Excel to calculate the Correlation Coefficient

Picture

Picture

Picture

Performing  a Correlation Coefficient test with the TI-83/84

  1. Hit the STAT button on the calculator
  2. Select option 4 to clear any past lists of data.
  3. Select option 1 to EDIT your lists.
  4. Enter your data for each variable as Lists. 5.      
  5. Hit STAT button and use the arrow key to move over to the CALC option      
  6. Scroll down to option 4, the LInReg (ax+b), press 4 and hit ENTER
  7. Press VARS.  Scroll down to 5:  STATISTICS and press 5 and hit
     ENTER.  
  8. Scroll over to EQ.  Then, scroll down to 7:r and press 7.  ENTER again to  get the correlation coefficient.  

Picture